INFO 659 Final Project Report¶

Calibrating Modular Air Quality Sensors¶

Problem Statement¶

The goal of this project is to employ machine learning techniques to calibrate data collected from Quant-Aq Pm modular air-quality sensors by leveraging information from gas-outdoor sensors. The primary objective is to enable the deployment of modular-PM air quality sensors across diverse locations and environmental conditions, ensuring reliable measurements under varying parameters such as temperature and relative humidity.

Dataset Description¶

The dataset is obtained from measurements of installed QUANT-AQ modular air-quality sensors in 3101 Market st, Philadelphia. There are three sensors; two PM modular sensors (alpha and beta) and one gas sensor (outdoor). The data variables relevant here are described below:

Column Units Description
timestamp The sample timestamp in ISO Format
Timestamp local Corrected time zone as defined in the device setting
Id Unique Id corresponds to the record shown
sn Device serial number
sample_rh Sample relative humidity
sample_pres mmHg Sample pressure
lat degree Latitude of the device
lon degree Longitude of the device
pm1 µg/m3 The PM1 Value
pm25 µg/m3 The PM10 Value
Pm10 µg/m3 The PM25 Value
pm1_model_id Id corresponding to PM1
pm25_model_id Id corresponding to PM2.5
pm10_model_id Id corresponding to PM10

Data Exploration and Visualizations¶

In exploring the dataset, we renamed certain variables in alpha and beta sensor to be common with the outdoor sensor. Also need to reset the seconds component to 0 for the time based index since the data obtains 1min average concentrations for each row, however this was written at different seconds for several rows. In performing a Regression, we first need a train-test split, and also to evaluate the model. A few things to note, first, the datetime column is in reverse order. Since we are working with three datasets, and we want to calibrate alpha to outdoor sensor and beta to outdoor sensor, we need pick the commmon timesteps accross the datasets should there be missing values in readings apparent with the different number of rows for each dataset. We also need to check for null values in the relevant parameters. PM1, PM25 and PM10 had a null row, which we had to drop before performing analysis.

The RH is not normally distributed. We used the log for rh for multivariable predictors in our models. To answer our objective, we want the outcome variable to be the outdoor sensor parameters, in this case PM1, PM2.5 and PM10 concentrations, while the predictor variables will be the parameters for the alpha and beta sensor each.

Correlations between dfBet and dfMod:
pm1     0.969138
pm25    0.968789
pm10    0.730997
rh      0.999919
temp    0.998878
dtype: float64

Correlations between dfAlp and dfMod:
pm1     0.963401
pm25    0.963212
pm10    0.762527
rh      0.999899
temp    0.998542
dtype: float64

Methodology¶

A detailed summary of the methodology includes:

  1. Data Collection:

    • Gather historical data from Quant-Aq Pm modular air-quality sensors and gas-outdoor sensors. Include parameters such as PM1, PM2.5, PM10, temperature, relative humidity, and gas concentrations.
  2. Data Preprocessing:

    • Handle missing values, outliers, and anomalies in the dataset.
    • Perform feature engineering to extract relevant information from the timestamp and create additional features if needed.
    • Normalize or scale the data to ensure uniformity across different features.
  3. Feature Selection:

    • Identify key features that significantly impact air quality measurements, considering correlations and domain knowledge.
    • Eliminate irrelevant or redundant features to enhance model efficiency.
  4. Model Selection:

    • Choose machine learning models suitable for regression tasks. The methods used in this report involved:
      • Simple Linear Regression
      • RandomForest
      • Xgboost
      • Multilinear Regression
  5. Model Training:

    • Split the dataset into training and validation sets. In this case analysis was done on a train-test split with 20% test data:
    • Train each selected model using the training set, considering parameters like temperature and relative humidity.
  6. Calibration and Prediction:

    • Apply the trained models to calibrate Quant-Aq PM modular sensor data based on information from gas-outdoor sensors.
    • Evaluate the calibrated models on the training set using a time series plot and R-square values to ensure accurate predictions.
  7. Performance Evaluation:

    • Assess model performance using metrics such as R-squared, mean, and p-values.
    • One to one plots was used to visualize fit
    • Investigate the impact of temperature and relative humidity on calibration accuracy.
  8. Optimization:

    • Fine-tune models based on performance evaluations and insights gained during calibration.
    • Optimize hyperparameters
  9. Validation and Testing:

    • Validate the final models on an independent test set to ensure generalizability.
    • Assess the models' performance with the test data.

This methodology aims to address the calibration challenge for Quant-Aq Pm modular air-quality sensors, providing a robust and adaptable solution for accurate air quality measurements across different settings.

A time series of the training data:

PM₁ P-Value Alpha: 2.719736226298462e-51
PM₁ P-Value Beta: 8.097760848647828e-07
PM₂.₅ P-Value Alpha: 9.849494170512791e-49
PM₂.₅ P-Value Beta: 1.5673137466827215e-05
PM₁₀ P-Value Alpha: 2.2170139363403661e-16
PM₁₀ P-Value Beta: 8.281707208895426e-06

Time series calibration from linear regression:

PM₁ P-Value Alpha: 4.3904015529516365e-06
PM₁ P-Value Beta: 0.08180089864504025
PM₂.₅ P-Value Alpha: 2.264099930127344e-06
PM₂.₅ P-Value Beta: 0.06224899867334791
PM₁₀ P-Value Alpha: 0.8967040709134199
PM₁₀ P-Value Beta: 0.21997416167439787

Calibration time-series on the test data for Random Forest, included log rh and temperature in addition to the variable of interest for the alpha and beta sensor to predict the gas outdoor sensor variable of interest:

PM₁ P-Value Alpha: 3.1191276275135696e-41
PM₁ P-Value Beta: 4.0175697383781956e-06
PM₂.₅ P-Value Alpha: 1.103623188723264e-40
PM₂.₅ P-Value Beta: 1.164108947121631e-05
PM₁₀ P-Value Alpha: 0.00012038761616026669
PM₁₀ P-Value Beta: 0.05300182382762207

Calibration output for xgboost, same predictor variables with randomForest and MLR:

PM₁ P-Value Alpha: 0.23020254308289304
PM₁ P-Value Beta: 0.7554758642896344
PM₂.₅ P-Value Alpha: 0.3210355868051801
PM₂.₅ P-Value Beta: 0.7269915190286456
PM₁₀ P-Value Alpha: 1.748584892922372e-07
PM₁₀ P-Value Beta: 5.140038709752224e-06

Calibration output for MLR:

PM₁ P-Value Alpha: 0.07217912994271163
PM₁ P-Value Beta: 0.9829651276451719
PM₂.₅ P-Value Alpha: 0.08110162431725541
PM₂.₅ P-Value Beta: 0.921411076940686
PM₁₀ P-Value Alpha: 0.15980401241363587
PM₁₀ P-Value Beta: 0.07274334506097305

Major Challenges and Solutions¶

1. Data Variability:

Air quality sensor readings are highly susceptible to fluctuations caused by dynamic environmental conditions. Changes in temperature and humidity levels can influence the accuracy and consistency of sensor measurements. This variability poses a challenge in developing a robust calibration model that can effectively adapt to diverse environmental scenarios. The challenge of data variability arises from the fact that air quality is inherently linked to environmental conditions. For instance, pollutant concentrations often change with fluctuations in temperature and humidity. These variations may not follow a linear pattern, making it challenging to create a one-size-fits-all calibration model. As a result, the model needs to discern between genuine changes in air quality and those induced by external environmental factors.

Potential Impact: Failure to address data variability can lead to inaccurate calibration, resulting in unreliable air quality measurements. It may also hinder the model's generalization capability when deployed in locations with different environmental characteristics.

2. Sensor Drift:

Sensor drift refers to the gradual deviation of sensor readings from their initial calibrated state over time. In the context of modular air quality sensors, this drift introduces uncertainties and errors in measurements, impacting the reliability of the collected data. Modular sensors, despite initial calibration, may experience gradual shifts in their performance characteristics. Factors like sensor aging, exposure to environmental elements, or changes in internal components can contribute to this drift. If not accounted for, sensor drift can lead to systematic errors in measurements, rendering the calibration less effective over time.

Potential Impact: Unmitigated sensor drift can compromise the accuracy and longevity of the calibration model. This challenge necessitates continuous monitoring and recalibration strategies to correct for any deviations and maintain the reliability of the sensor data.

3. Limited Training Data:

Challenge Explanation: Calibrating machine learning models requires a substantial amount of labeled training data. In the case of air quality sensors, the availability of such data may be limited, hindering the model's learning capacity. Creating an effective calibration model relies on exposing it to diverse scenarios through labeled training data. However, obtaining a comprehensive dataset that encompasses various environmental conditions and locations can be challenging. Limited training data may lead to a model that struggles to generalize well, particularly in unique deployment settings. Here we used only October data, which is also limited.

Potential Impact: The model's inability to generalize due to limited training data may result in suboptimal performance, especially in environments not well-represented in the training set. This challenge underscores the importance of exploring techniques like transfer learning and data augmentation to enhance model adaptability.

Conclusions and Future Work¶

The MLR model performed poorly among other models and did not produce statistically significant results with P-values greater than > 0.05, which was also apparent in the MLR calibrated time-series plots. The random Forest produced statistically significant results across both sensors and variables with the exception of the PM10 beta sensor, which had a p value of 0.053. Again, the PM10 modular sensors have been shown to be poorly correlated with the gas sensor. However, the random Forest model performs the best among all models given the p values and the time series plots. And still has statistically significant results for significance level of 0.1. Xgboost however, performed better for PM10 species with statistically significant results across all possible significance levels, but seemed to perform poorly for PM1 and PM2.5 species.